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ABSTRACT 

The lattice-Boltzmann method (LBM) is an algorithm for CFD sim- 
ulations that has gained popularity due to its ease of implementa- 
tion and suitability for complex geometries. Its scalability on mul- 
ticore chips is often limited due to its low computational inten- 
sity, leading to interesting characteristics regarding optimal perfor- 
mance and energy to solution on the chip and highly parallel levels. 
In this paper we perform a thorough analysis of a two-relaxation- 
time (TRT) model in a sparse lattice representation on the Intel 
Sandy Bridge processor. Starting from a single-core performance 
model we can describe the intra-chip saturation characteristics of 
the implementation and its optimal operating point in terms of en- 
ergy to solution as a function of the propagation method, the clock 
frequency, and the SIMD vectorization. We then show if and how 
these findings may be extrapolated to the massively parallel level 
on a petascale-class machine, and quantify the energy-saving po- 
tential of various optimizations. 
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1. INTRODUCTION 

For many years, high performance computing focused on optimiz- 
ing the sustained performance only, i.e., to make a code faster to 
reduce the time to solution. Owing to the increasing energy con- 
sumption and energy costs of HPC systems, additional metrics like 
energy to solution are brought into focus. Adapting the processors' 
clock frequency to the needs of an implementation are one way to 
influence the energy consumption of a compute node. However, the 
complex nature of today's compute nodes with typically at least two 



sockets and multi-core processors as well as the different perfor- 
mance bottlenecks - like memory bandwidth, instruction through- 
put, and arithmetic units - and their saturation behavior make an a 
priori prediction difficult. Some of these bottlenecks, in particular 
the sustained memory performance, often depend on the proces- 
sors' clock frequency. 

In this paper we use a lattice-Boltzmann method (LBM) from com- 
putational fluid dynamics (CFD) to perform a thorough analysis of 
performance and energy to solution on the chip and highly paral- 
lel levels. Starting from the a single-core execution-cache-memory 
(ECM) performance model we can describe the intra-chip satura- 
tion characteristics of the implementation and its optimal operating 
point in terms of energy to solution as a function of the propagation 
method, the clock frequency, and the SIMD vectorization. 



1.1 Related work 

The roofline model of WILLIAMS et al. [ ] provides valuable in- 
sight into how much performance can be achieved with regard 
to the typical limiting factors memory bandwidth and arithmetic 
throughput. This allows for a first assessment of how far the perfor- 
mance of the code at hand deviates from the maximum sustainable 
performance (the "light speed"). 

Performance modeling and prediction especially in the context of 
LBM are an ongoing research topic of many groups in engineer- 
ing and computer science [ , ]. Auto-tuning was used, e.g., in [ ] 
for a magnetohydrodynamics LBM. The ILBDC LBM code we use 
in the present work has been optimized previously [ ] and its sus- 
tained performance is close to predictions from performance mod- 
els [6]. 

Research in the direction of energy-saving hardware and software 
mechanisms focuses on models and algorithms for dynamic volt- 
age and frequency scaling (DVFS) and dynamic concurrency throt- 
tling (DCT) [ ]. With frequency scaling not only the computational 
performance of a core changes, but as a side effect also cache and 
memory bandwidth can be influenced [ , ]. Any realistic model- 
ing effort must take these effects into account. The ECM model 
[10, 8] is a refinement of the roofline model and allows a more ac- 



curate prediction of tlie intra-chip scaling properties of a parallel 
code. In [ ] we have used it to model simple streaming kernels and 
a specific variant of an LBM solver. Together with a simple power 
model derived for the Sandy Bridge chip we were able to explain 
the performance saturation and energy to solution characteristics of 
this solver The analysis in this paper goes much deeper, and con- 
siders a more advanced propagation method. It also extends beyond 
the single chip to fathom the energy and performance properties of 
the solver in highly parallel runs. 

1.2 The lattice Boltzmann method 

Lattice-Boltzmann methods have become a popular approach in 
computational fluid dynamics. However, they are also interesting 
for computer scientists as the core algorithm is rather short, uses a 
larger Jacobi-like stencil with vector instead of scalar data result- 
ing in many concurrent memory streams and no reuse of data in a 
single iteration, but is straightforward to parallelize due to simple 
next-neighbor communication. 

In the present work we employ the ILBDC code [ ], which uses a 
D3Q19 lattice model and implements a two-relaxation-time (TRT) 
collision model [ ]. All calculations are performed in double- 
precision floating point arithmetic. The algorithm with the D3Q19 
model can be viewed as a 19-point stencil in 3-D accessing only 
nearest neighbors, but has two important differences to common 
stencil algorithms: (i) each lattice node consists not only of one, 
but of 19 values (so called particle distribution functions [PDFs]); 
(ii) each PDF accessed is only accessed again in the next time 
step, which prevents data reuse (unless complex temporal blocking 
schemes are employed). 

The performance of a given LBM approach depends at least on the 
data layout and memory access patterns, the scale of arithmetic op- 
erations (i.e., how well numerical expressions are simplified and 
combined to avoid unnecessary operations), and their degree of 
SIMD vectorization. A thorough overview of more propagation- 
step implementations and their memory access characteristics can 
be found in [12, 6]. 

It seems natural to store the PDFs in a 4-D array with an additional 
flag field, which is used to distinguish fluid and solid nodes. This 
is known as the marker-and-cell approach. However, LBM simu- 
lations of domains with a large fraction of solid nodes can benefit 
from a sparse representation of the domain [ L3, 14, L5, 16, 17, 18], 
where only the fluid nodes are kept in a 1-D vector. Indirect ac- 
cesses to PDFs of neighboring nodes are then required and accom- 
plished through an adjacency list (IDX), which represents the topo- 
logical connections of the nodes. ILBDC uses such a sparse repre- 
sentation. 

For updating one node, optimized implementations read one PDF 
from each of the 19 surrounding neighbors (streaming step), com- 
pute updated values (collision step), and write the results to the 
PDFs of the local node. This is known as the pull scheme [19]. It 
is implemented in the ILBDC code together with a structure-of- 
arrays (SoA) data layout where all PDFs of a direction are stored 
consecutively in memory before the next direction follows. 

To work around the data dependency problems of a combined 
stream-collide step, two lattices are often used, one as the source 
and one as the destination. Then, a fluid lattice-node update (FLUP) 
requires 19 PDF loads, 19 additional PDF loads because of write- 
allocate transfers, 19 PDF stores and 18 IDX loads of the adjacency 



list. Assuming double-precision floating-point numbers (eight 
bytes) for PDF and four-byte integers for IDX, the total number of 
bytes that must be transferred between CPU and memory for one 
FLUP is 3 X 19 X 8 (PDF) +18x4 (IDX) bytes = 528 bytes. The 
write-allocate is triggered by the cache hardware on store misses 
and loads the data at the accessed location (the complete cache line, 
to be exact) from memory into the cache before it is updated. With 
non-temporal store instructions these write-allocates are avoided 
and the data is directly written from the processor into memory, 
bypassing the cache hierarchy. The number of bytes required for 
one FLUP decreases then to 2 x 19 x 8 (PDF) +18x4 (IDX) bytes 
= 376 bytes. Current standard processors have difficulties with 
19 concurrent write streams, in particular if they consist of non- 
temporal stores. As a remedy, blocking can be applied so that a 
node's PDFs are read in chunks and updated values are stored in 
a small temporary buffer, which should be small enough to fit in 
the LI cache. From this buffer, two directions of the updated PDFs 
at a time are written with non-temporal stores to the destination 
lattice. We call this implementation pull-split no-NT or pull-split 
NT, depending on whether normal stores or non-temporal stores 
are used. 

Bailey's AA pattern ["Ti] for the PDF access allows using one 
single lattice only (instead of separate source and destination grids) 
while maintaining the possibility to update all cells in any order 
and in parallel. The iterations over the lattice are divided into even 
and odd time steps. During an even step only PDFs of the cur- 
rent node are accessed in each lattice site update. At the follow- 
ing odd step only PDFs of the neighboring nodes are accessed, 
which requires indirect addressing in our case of the sparse rep- 
resentation. With this update scheme only stores to locations in 
memory occur which have previously been read. No write-allocate 
will be performed as the data to be updated already resides in the 
cache. We use an optimized version where the even time step is 
completely SIMD-vectorizable (SSE/AVX), which can easily be 
accomplished as all PDFs are accessed consecutively and no in- 
direct access is required. In the odd time step a partial vectoriza- 
tion is performed, which can avoid the indirect addressing and al- 
lows for vectorized execution of consecutively stored chunks of 
PDFs. Nodes which cannot be treated in this way are regularly up- 
dated without SIMD vectorization (i.e., in scalar mode). The frac- 
tion of nodes which can be updated with SIMD operations depends 
on the geometry used for the simulation. During even time steps 
2 X 19 X 8 (PDF) bytes = 304 bytes per FLUP are required. In the 
odd time step the number of bytes needed for one FLUP depends 
on the fraction of vectorizable updates. The lower bound is the case 
when all updates can be vectorized. Here only 2 x 19 x 8 bytes 
— 304 bytes are required. The upper bound is reached when all 
updates must be scalar and indirect accesses are required, which 
results in 2 X 19 X 8 (PDF) +18x4 (IDX) bytes = 376 bytes. 

1.3 Test bed and benchmark cases 

SuperMUC [ ], which is installed at Leibniz Supercomputing 
Center (LRZ)' in Garching near Munich, is a tier-0 PRACE" 
system and one of the main federal compute resources in Ger- 
many. It is built from a number of 512-node "islands," with a fully 
non-blocking fat tree FDR 10 InfiniBand connectivity inside each 
island. A compute node comprises two Intel Sandy Bridge ("SNB") 
EP (Xeon E5-2680) eight-core processors with a base clock fre- 
quency of 2.7 GHz. The actual clock speed of the processors can 

http: //www. lrz.de/english/ 
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Figure 1 : Visualization of the packed bed reactor geometry used in 
the benchmarks. 



be influenced by a so-called "energy tag," which is supplied upon 
LoadLeveler job submission, together with a parameter specifying 
how much performance degradation the user wants to tolerate for 
their job. A heuristic-based mechanism based on hardware perfor- 
mance counter measurements of the user's previous jobs with the 
same energy tag then sets the clock frequency for the job. Since 
SuperMUC does not easily allow an arbitrary frequency setting, 
and the "turbo mode" feature of the Intel processors cannot be used 
at all, all single-node benchmark tests were run on a standalone 
Sandy Bridge EP node with the same type of CPU and otherwise 
similar characteristics. 

The Intel Fortran/C compiler 13.1 and Intel MPI 4.1 were used in 
all cases. The operating system on SuperMUC is SuSE SLESll. 
Each MPI process was explicitly pinned to its physical core using 
sched_setaf f inity within the code. All benchmarks were run 
inside a single island to guarantee that communication is performed 
through the fully non-blocking fat tree. 

Two geometries were selected for the benchmarks in this paper. 
The first is an empty channel which consists only of fluid nodes 
except for the walls. The second geometry is a packed bed reactor, 
i.e., a tube filled with spheres (see Fig. 1). It represents a real world 
application case for flow simulation with this type of code. Both 
geometries have dimensions of 4000 x 80 x 80 nodes, resulting in 
25 • 10*' fluid nodes (f% 3.8 GB lattice + 1 .8 GB adj. list) and 19 • lO*" 
fluid nodes (~ 2.9 GB lattice + 1.4 GB adj. list) for the channel and 
the reactor geometry, respectively. 

With these dimensions both geometries fit into the NUMA local- 
ity domain of one socket on SuperMUC. For strong scaling runs 
the reactor geometry was enlarged to 8000 x 160 x 160 nodes, as 
the smaller lattice fits in the L3 caches of 128 compute nodes and 
above. The large reactor geometry consists of 157 • 10^ fluid nodes 
and requires around 24 GB of memory for the lattice and around 
1 1 GB for the adjacency list. 

The ILBDC code is purely MPI-parallel. All single-node measure- 
ments were thus performed with intra-node MPI only; we expect 
no major differences for an equivalent OpenMP-parallel version. 
Details abut the MPI parallelization can be found in Sect. 4.1. 

2. ECM PERFORMANCE MODEL 

The ECM model [ , ] is a refinement of the roofline model [ , 
23]. It starts with an analysis of the serial loop instruction code 
to get an estimate for the "applicable peak performance," i.e., the 
expected maximum performance when all data comes from the LI 
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Figure 2: Intra-socket strong scaling of the AA pattern LBM imple- 
mentation for an empty channel, comparing AVX, SSE, and scalar 
code at the clock frequencies of 2.7 GHz (solid lines) and 1.2 GHz 
(dashed lines). The corresponding memory bandwidths are indi- 
cated for selected cases. 



cache. After that, all required data transfers needed to get the data 
in and out of LI are accounted for, together with the time it takes to 
move the cache lines up and down through the memory hierarchy 
levels. Pure streaming is assumed, i.e., there are no latency effects 
and the prefetching mechanisms work perfectly. 

Both steps require some knowledge about the architecture, such 
as instruction latency and throughput, data path widths, and the 
achievable memory bandwidth. If some of this data is unavailable, 
suitable microbenchmarks or tools can be employed. Especially for 
the in-core performance analysis the "Intel Architecture Code An- 
alyzer" (lACA) [ ] is instrumental in getting a better view of the 
relevant bottlenecks. All data transfers between the memory hier- 
archy levels occur in packets of one cache line; hence, the ECM 
model always considers a "unit of work" of one cache line's length. 
If, e.g., a loop accesses single precision floating-point arrays with 
unit stride on a CPU with 64-byte cache lines, the unit of work is 
sixteen iterations. 

Since it is sometimes hard to predict which of those contributions 
to the serial runtime of a loop can overlap, the ECM model gener- 
ates a prediction interval based on no-overlap and full-overlap as- 
sumptions, respectively. In order to get the scaling behavior across 
the cores of a multi-core chip it is assumed that each core creates 
some bandwidth pressure on the relevant bottleneck, which is usu- 
ally a shared cache or the chip's memory interface. When the ag- 
gregate bandwidth pressure exceeds the available bandwidth, satu- 
ration sets in. 

2. 1 Observed chip-level performance and scal- 
ing 

As a motivation for doing this kind of analysis we show in Fig. 2 
the intra-socket scaling of the empty channel test case for the two 
"extremal" clock frequencies of 2.7 GHz and 1 .2 GHz, respectively, 
in three variants: AVX-vectorized (full 256-bit loads/stores), SSE- 
vectorized, and scalar. The data indicates that SIMD vectorization 
has a large impact in the serial case; in fact, the serial performance 
differs by more than a factor of two between the AVX and the 
scalar code. At 2.7 GHz, the gap closes as the number of cores is 
increased. On the full socket the scalar code is hardly 10 % slower 
than the AVX variant. The latter, however, reaches the same level 
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Figure 3: Intra-socket performance scaling for one SNB ciiip at 
2.7 GHz: AA pattern in AVX, SSE, and scalar variants (circles, 
squares, diamonds) and pull-split pattern with AVX vectoriza- 
tion with and without non-temporal stores (triangles), both for 
the empty channel application case (solid lines). The performance 
numbers for the packed reactor case are shown with dashed lines. 



already with four cores, which opens an opportunity for saving en- 
ergy by leaving cores idle. 

At 1.2 GHz, the situation in the serial case is similar, but on a lower 
level. The single-core performance of all code variants is roughly 
proportional to clock speed. However, only the AVX-vectorized 
code shows a saturation pattern, while the SSE and scalar variants 
scale linearly up to eight cores without reaching a bandwidth bar- 
rier. Hence, lack of vectorization ("slow code") cannot be compen- 
sated by using more cores in this case. Moreover, the maximum 
memory bandwidth is correlated with the core clock frequency and 
varies by about 20 % across the full frequency range [' ]. 

In Fig. 3 we show a socket-level performance comparison of the 
scalar and vectorized AA pattern implementation with the pull-split 
pattern for both application cases (empty channel vs. packed reac- 
tor) at a clock speed of 2.7 GHz. Although there is a large fraction 
of obstacles in the packed reactor geometry, their presence hardly 
influences the performance, independent of the propagation pattern. 
We also see that the pull-split pattern is not competitive, since it 
cannot by far saturate the memory bandwidth of the chip. 

The intention of applying the ECM model is to gain deeper insight 
into this performance behavior, and to pave the way for a practically 
useful energy consumption analysis. 

2.2 In-core analysis 

An lACA throughput analysis for the AA pattern kernel shows that 
the ADD port of the SNB core is the sole bottleneck of core exe- 
cution for all variants (scalar, SSE, AVX), as well as for even and 
odd time steps, and that one vectorized unit of work (eight lattice 
site updates) should take about 135 cycles. In contrast, a critical 
path analysis reports somewhat longer execution times due to de- 
pendencies in the instruction and data flow. The critical path de- 
pends on the type of time step and has a maximum length of 163 
cycles (even, AVX), 187 cycles (odd, scalar), or 212 cycles (odd, 
AVX). This prediction coincides with direct measurements, which 
we will use as an input in the following (160, 160, and 212 cycles, 
respectively). The analysis for the packed reactor case is surpris- 



Listing 1 : Parallel multi-stream update benchmark with 19 streams. 
N is chosen such that the arrays do not fit in any cache. 

double aOl [N] , a02 [N] , . . . , al9 [N] , s=2.0; 
#pragma omp parallel for 
forCint 1=0; i<N; ++i) i 
aOl [i] = s * aOl [i] ; 
a02 [i] = s * a02 [i] ; 



al9 [i] 



s * al9 [i] ; 



ingly very similar: the even time step does not change at all, since 
no index access is required. In the odd time step, even when as- 
suming no potential for vectorization (as would be the case for an 
extremely porous geometry) there is ample room for hiding the ad- 
ditional loads for the index array due to the bottleneck on the ADD 
port. This step is necessarily scalar, however, so the execution time 
is about 4 times longer per unit of work. The actual impact of this 
slowdown depends on the fraction of vectorizable updates. In our 
applications, this fraction is roughly 97 % for the empty channel 
and 92 % for the packed reactor case. This leads to a very small 
performance penalty for the latter, which was already observed in 
Fig. 3. Hence, we will only consider the empty channel case for the 
rest of the chip-level analysis. 

2.3 Data transfers and saturation behavior on 
the chip 

The ECM model requires the maximum attainable memory band- 
width as an input parameter. It is known that this value depends 
on the number of parallel read/write streams as well as the CPU 
clock speed. From a data transfer perspective, the AA-pattem im- 
plementation of the D3Q19 LBM algorithm reads 19 arrays from 
memory, modifies their contents, and writes them back. In order 
to get the maximum memory bandwidth on the socket we hence 
use a parallel multi-stream array update benchmark (see Listing 1). 
It is designed to mimic the data streaming behavior of the LBM 
algorithm. 

Figure 4 shows the achieved memory bandwidth on one SNB 
socket with varying number of threads (cores) and clock frequen- 
cies between 1.2 GHz and 2.7 GHz (plus turbo mode). As predicted 
by the ECM model, the single-thread performance is proportional 
to the clock speed, and the saturation point is shifted to larger thread 
counts as the clock speed decreases: While saturation is reached 
near three cores with turbo mode, up to six cores are needed at the 
lowest frequencies. Due to the large number of read/write streams, 
the maximum bandwidth is significantly lower than with a stan- 
dard single-stream update kernel (dashed line in Fig. 4). At the 
same time, the maximum (saturation) memory bandwidth drops 
by about 25 % over the whole frequency range; there is another 
substantial drop when using the full socket (eight cores) at the low- 
est frequency. As of now we have no conclusive explanation for 
these latter effects. They do, however, influence the considerations 
on energy dissipation, which will be discussed in Sect. 3. In the 
following we will use the maximum bandwidths as measured at 
the respective frequencies as an input to the ECM model in order 
to calculate the number of cycles required to transfer cache lines 
between memory and L3 cache. Data transfers between adjacent 
cache levels are assumed occur at 32 bytes per cycle, so these 
bandwidths are linear in the clock frequency. 




Figure 4: Multi-stream update benchmark performance scaling on 
one SNB socket with different CPU frequency settings. 19 update 
streams were run per thread. The dashed line indicates the maxi- 
mum achievable bandwidth with a simple single-array update ker- 
nel. 



2.4 Validation of the performance model 

The various execution and data transfer times that go into the ECM 
model in Fig. 5 may be combined in different ways to arrive at a 
performance prediction for the serial program: 



1 . The most conservative assumption is that none of those con- 
tributions overlap with each other, so that the execution time 
is equal to their sum. 

2. The most optimistic assumption is that the cycles in which 
the LI cache is occupied by loads and stores from the core 
cannot be used for reloads and evicts to L2, but all other con- 
tributions overlap. 
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Figure 5: Single-core ECM model of the AA propagation pattern 
for D3Q19 LBM (eight FLUPs). Even and odd time steps have 
different in-core timings. One arrow represents the number of full 
cache line transfers indicated; dashed arrows stand for half-wide 
(32-byte) transfers and are required for loading the adjacency in- 
formation in the odd time step when vectorization is not possible. 
One half-wide cache line transfer takes one cycle. 




ECM AVX LI overlap 
ECM AVX best 
ECM AVX worst 
ECM scalar worst 



Figure 6: Performance of the AVX (diamonds), SSE (squares), and 
scalar (circles) implementations of the LBM algorithm with AA 
propagation (empty channel case). The ECM model predictions for 
AVX with full overlap assumption (solid line), no overlap (dashed 
line), and partial overlap at LI (grey) are shown for comparison. 
The two lower data sets show the performance of the pull-split ver- 
sion with and without NT stores. 



Lastly one may assume that the pure in-core execution part 
(everything except loads and stores) can overlap with loads 
and evicts from/to the L2 cache, but that there is no overlap 
beyond that. 



In the following we restrict the discussion to the most relevant case. 

Figure 6 shows a comparison of the measured performance for the 
AVX-vectorized AA pattern implementation (diamonds) with the 
three models described above. Apart from the region around the 
saturation point (3-4 cores), the third assumption seems to provide 
the best fit to the data. In the scalar case (circles), the no-overlap 
assumption, although it is within a 10% margin to the measure- 
ments, is still too optimistic, which suggests that either there are 
additional overheads in the memory hierarchy that the model does 
not describe, or that one of the assumptions of the model, e.g., per- 
fect streaming without latency effects, is not (fully) satisfied due to 
the slower rate of requests into the cache hierarchy at scalar execu- 
tion. 

For completeness the picture also shows the AVX implementation 
of the pull-split propagation pattern with and without non-temporal 
stores (triangles). Although the pull-split NT version has almost the 
same computational intensity as the AA pattern, it is not competi- 
tive. This failure can mainly be attributed to the fact that the pull- 
split variant cannot be efficiently SIMD-vectorized on the Sandy 
Bridge architecture due to the indirect access in every lattice site 
update. More specifically, the loop which loads the neighboring 
distribution functions and stores intermediate results in to tempo- 
rary buffers is scalar. The loops which write the updated data back 
to the grid for the new time step can be vectorized, which makes 
the use of non-temporal stores possible. There is also a slight fur- 
ther degradation of achievable memory bandwidth as measured by 
a 19-stream array-copy benchmark, but, as Fig. 6 shows, this bot- 
tleneck does not apply due to the low single-core performance with 
pull-split. 

For these reasons we will ignore pull-split from now on and focus 
the following discussion on the AA pattern. 



3. POWER MODEL 

Recently, considerations of power dissipation and energy to solu- 
tion have received much interest in the supercomputing community. 
The two prevalent questions are: (i) How can a parallel code be run 
so that its overall energy consumption until a solution is reached 
can be minimized, preferably under the constraint of constant time 
to solution? and (ii) How can a parallel computer be operated in a 
production environment so that overall power dissipation is mini- 
mized or kept below a given maximum? 

We concentrate on the first question here. In [■"'] we have established 
a simple power model which is able to explain many of the peculiar 
properties of the energy-to-solution metric on a multicore proces- 
sor for load-balanced codes that may show some performance sat- 
uration as the number of cores used is increased. In the following 
we briefly summarize its derivation and the most important conclu- 
sions drawn from it. 

3.1 Power versus clock speed and energy to 
solution 

Direct measurements on the Sandy Bridge chip [8] with the RAPL 
interface [^', ^(] and the likwid-powermeter tool [27, 28] show an 
expected strong growth of power dissipation with rising clock fre- 
quency. Although it is not clear what the actual dependence should 
be, measurements suggest a power law with an exponent between 2 
and 3. In the following we assume a quadratic behavior so that the 
power dissipation is 



w{f,t)^wo+(wif+w2fy, 



(1) 



where / is the clock frequency and / is the number of cores used. 
Since W\ is usually very small we neglect it in the following. The 
part which is linear in / is the dynamic power dissipation, whereas 
Wo is the baseline power. Wq may include contributions from the 
rest of the system, such as memory, I/O circuitry, network, disks, 
and cooling. Wq and W2 are determined by suitable benchmark 
codes; here we choose W2 ~ IW/GHz , Wq ~ 23W for the chip 
level, and Wq ~ 73W for the whole system (per socket). Note that 
we do not expect these values to be exact; they are sufficient, how- 
ever, for a qualitative analysis. 

Energy to solution is proportional to the ratio of power dissipation 
to performance (for constant problem size). We assume that multi- 
core performance can be modeled by 



P{t) =mm( jtPo,Pn, 



(2) 



where /o is the chip's base frequency, Pq is the serial performance 
at /(), and P^ax is a maximum performance. Pm^x may be set by the 
presence of some bottleneck such as the memory bandwidth, or it 
may be infinite if the code is perfectly scalable. Hence, the energy 
to solution is 

^_Wo+{Wif + W2f)t 



l(^f/'0,/'max) 



(3) 



It is a simple consequence from this model that a minimum en- 
ergy to solution is reached at the performance saturation point, 
i.e., at the smallest number of cores for which P{t) = Pmax- If by 
some means P^^x can be increased, e.g., by choosing a propagation 
method which is able to make use of the full memory bandwidth 
(such as the AA pattern), energy to solution will immediately go 
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Figure 7: Energy to solution vs. performance of the AVX- 
vectorized (solid lines) vs. scalar (dashed lines) LBM AA pattern 
implementation (empty channel case) of one SNB socket for differ- 
ent clock frequencies (symbols). The number of cores used is the 
parameter along each data set. (a) Predictions by the ECM perfor- 
mance model and the chip-level power model, (b) Measured data. 
The shaded area is the region defined by absolute minimum energy 
and saturated performance for the AVX versions. The dashed line 
is the line of constant energy-delay product that hits the saturation 
point of the lowest-frequency run. 



down proportionally. In case the available number of cores is too 
small to get saturation, i.e., if the performance scales well, all cores 
must be used for minimum energy consumption. 

The single-core performance Pq also appears only in the denomi- 
nator. In the application cases we consider here, it is mainly influ- 
enced by the SIMD vectorization and thus, indirectly, by the choice 
of propagation pattern. The larger Pq, the fewer cores are needed to 
reach saturation, so there is an inherent energy saving potential in 
having a fast single-threaded code. 

The dependence of E on the clock frequency is more complex: if 
there is no saturation there may be an optimal frequency /opt for 
which E is minimal: 



/c 



opt 



(4) 



If W() comprises only the chip's idle power, /o is in the range of 
accessible frequencies. However, a low frequency setting has the 
adverse effect of extending the time to solution, which may not 
be desirable. Cost models other than time or energy (such as the 
energy-delay product) may then be needed to determine whether 
this is acceptable. On the other hand, if Wq contains (the chip's 
share of) the baseline power of the complete node, /opt is beyond 
the highest possible setting, and "clock race to idle" is the strategy 
of choice. 

3.2 Energy to solution for the LBM solver on 
the chip 

The ECM model and the power model enable a combined analysis 
and a deeper insight into the energy and performance properties 
of the LBM algorithm. It has proven to be useful to plot energy 
to solution versus performance, with the number of cores used 
as a parameter within a data set for a specific frequency, SIMD 
vectorization variant, propagation method, or other property. This 
has been done in Fig. 7a for three different clock frequencies and 



turbo mode, using the AA pattern in AVX and scalar variants (SSE 
is omitted from now on for clarity). In turbo mode, each data 
point was computed using the maximum allowed frequency for 
each number of active cores. The corresponding measurements are 
shown in Fig 7b. Note that we always show energy to solution in 
arbitrary units, but the values shown are coherent for a specific 
problem size. 

The models are able to describe the qualitative features of energy 
and performance. The observed deviations are caused by (i) the 
inability of the ECM model to accurately describe the performance 
behavior in the vicinity of the saturation point, (ii) the inaccuracy in 
determining W2 and Wq, and (iii) the approximation of linear power 
behavior with respect to core count even with saturated codes like 
LBM at higher clock speeds. In addition, turbo mode does not fit 
perfectly into the model (3) since the SNB chip can operate beyond 
its thermal design power (TDP) for a limited amount of time [ ]. 
This is why the deviation from the measurements is especially large 
with turbo mode (right-pointing triangles in Fig. 7). 

Looking at the minimum energy point with respect to clock fre- 
quency and number of cores in the regime where performance is 
not saturated, we see that this point moves to smaller frequency as 
the core count goes up, as described by (4). In general, all other 
things being equal, a faster sequential code (AVX instead of scalar) 
saves energy. Comparing energy to solution for the AVX codes at 
their respective saturation points, we can identify an "optimization 
space" (shaded area in Fig. 7b), in which the desired optimal oper- 
ating point should be found. Depending on the emphasis one wants 
to put on energy minimization vs. minimal time to solution, this 
point may be in the lower left comer of the area. In this case one 
would use all cores at the lowest frequency (1.2 GHz, filled circles) 
and sacrifice about 20 % of performance compared to the right edge 
of the area, which is defined by the saturation point at higher fre- 
quencies (2.0 GHz to turbo mode). Another clear conclusion is that 
turbo mode is of no good use for the LBM implementations studied 
here, neither from a performance nor from an energy point of view. 

There is no single, well-defined criterion for identifying the optimal 
operating point on the chip level. One may certainly employ cost 
models such as the energy-delay product (ratio of energy and per- 
formance), but this is only one possible choice. For reference we 
have included a line of constant energy-delay product in Fig. 7b. 
From the data we have collected, using 5-6 cores at 2.0-2.3 GHz 
seems to provide a good compromise between performance loss 
and energy consumption ("as far on the lower right as possible"). 

While the model and the measurements yield a consistent picture 
on the chip level, it is clear that the chip constitutes only a (however 
significant) part to the overall power consumption of a compute 
node. As mentioned above, when assessing the real energy demand 
for running an application, the rest of the system should be taken 
into account. We do this by setting Wq = 73W for the chip-level 
baseline power, which amounts to roughly 300 W of node power 
(assuming two-socket nodes). This is also the value measured dur- 
ing a LINPACK run on SuperMUC [_:v]. With this change we can 
offset the energy measurements from Fig. 7 to arrive at the data 
shown in Fig. 8. 

As expected, the modified baseline power leads to a reduction of 
the vertical spread between the measurements for different clock 
frequencies. While it was possible with the chip-level (i.e., small) 
Wo to have a situation where energy to solution was heavily in- 



0- 

•- 


-0 

-• 


1.2 GHz scalar 
1.2 GHz AVX 


A- 


-A 


2.0 GHz scalar 




—A 


2.0 GHz AVX 


V- 


-V 


2.7 GHz scalar 
2.7 GHz AVX 


>- -> Turbo scalar 


►- 


— ► 


Turbo AVX 




40 50 60 70 80 
Performance [MFLUP/s] 



90 100 110 120 



Figure 8: Same data as in Fig. 7b but with a power baseline of 50 W 
added to the socket. The circle marks a possible optimal operating 
point for almost minimal energy with a tolerable loss in perfor- 
mance. For reference, the best data for the pull-split propagation 
pattern (vectorized, full socket) for 1.2, 2.0, and 2.7 GHz is shown. 



fluenced by frequency and SIMD vectorization even at a specific 
performance level (with a spread of up to 2 x within the optimiza- 
tion space shown in Fig. 7), a large Wq reduces the spread to about 
25 %. Hence, a large baseline power favors the "race to idle" prin- 
ciple where the most influential parameter is performance; opti- 
mizations that favor a larger saturation performance (such as the 
AA propagation pattern, or blocking schemes which increase the 
computational intensity) have the most potential for saving energy. 
In addition, optimized clock speed and a reduction of the number 
of cores used can yield second-order but still significant savings. 
Within the transformed optimization space (shaded area in Fig. 8) 
we can identify a possible optimal operating point at about 2.0 GHz 
and six cores, with almost minimal energy to solution and a perfor- 
mance loss of about 6 % compared to the highest possible satura- 
tion level. In comparison to a naive strategy of running on all cores 
with turbo mode enabled and a scalar kernel, more than one third 
of the energy can be saved. 

The "race to idle" principle with respect to maximum code perfor- 
mance is evident from a comparison with the energy-performance 
data for the pull-split pattern (filled squares) in the best variant 
(SSE or AVX vectorized, non-temporal stores, full socket) at three 
different frequencies: The pull-split pattern can neither compete 
with AA in the performance nor in the energy dimension. Using 
AA, almost a factor of two in energy and 30^0 % of runtime can 
be saved in comparison to pull-split. 

4. HIGHLY PARALLEL LBM SIMULA- 
TIONS 
4.1 MPI parallelization in ILBDC 

ILBDC uses an MPI parallelization with a static load balancing 
scheme. The sparse representation of the lattice is cut into equally 
sized chunks, so that each MPI rank receives the same number of 
fluid nodes (probably off by one). The interfaces of such gener- 
ated partitions can be arbitrarily formed with different numbers of 
partition neighbors, as the simple cutting of the sparse representa- 
tion does not consider any topological information. However, in the 
case of the channel and reactor benchmark geometries this method 
results only in a 1-D decomposition, where each rank only needs to 
exchange ghost PDFs with its two direct neighbors. A more exten- 
sive description of this approach can be found in [ ,,]. We distribute 
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Figure 9: Parallel efficiency of the large packed bed reactor appli- 
cation case (8000 x 160 x 160 lattice nodes) for different frequency 
settings and different number of processes per chip (PPC) on up to 
128 nodes of SuperMUC. The efficiency calculation was based on 
the four-node performance baseline. 
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Figure 10: 1MB sendrecv benchmark results on two SuperMUC 
nodes for different CPU clock speed settings with 16 processes per 
node (filled symbols) and one process per node (open symbols). 
The mapping of MPI ranks to cores was set for minimum inter- 
node traffic (consecutive ranks on the cores of a node). The shaded 
area indicates the range of message sizes for the LBM application 
test case (packed bed reactor geometry). 



the ranks linearly across the compute nodes, so that consecutive 
ranks are located nearby on the same node. With this ordering only 
the first and the last rank on a compute node must perform inter- 
node communication. All the remaining ranks communicate within 
the node only. In the case of strong scaling the communication vol- 
ume of a process stays constant, since each rank only gets a smaller 
segment of the long geometries with an increasing number of MPI 
processes. 

The packed bed reactor geometry was used for all the multi-node 
experiments, since it is the application scenario that is relevant in 
practice. We have shown earlier that the node-level performance 
(and thus power) properties are very similar to the empty channel 
case. Note also that "turbo mode" cannot be activated on Super- 
MUC, so we stick to the fixed frequencies of 1.2, 1.7, 2.3, and 
2.7 GHz in the following. 

4.2 Performance and energy at strong scaling 

4.2.1 Parallel efficiency and communication perfor- 
mance 

All variants of the AA pattern scale well up to 32 nodes (512 cores) 
at all frequencies, and parallel efficiency only starts to degrade be- 
low 90 % beyond that point. Scaling experiments were performed 
on up to 128 nodes (2048 cores), since this is where some variants 
start to show efficiencies as low as 60 %. We assume a sensible limit 
of 50-60 % of parallel efficiency for production use in a computing 
center environment. A lower efficiency, which must be regarded 
as a waste of resources, should be justified by special needs, for in- 
stance when large aggregate memory is required. In Fig. 9 we show 
the parallel efficiency of the strong scaling runs versus the number 
of nodes at the four chosen frequencies and with between four and 
eight processors per chip. Since the application case is too large to 
fit on a single node, all efficiency numbers were normalized to the 
four-node run. 

Usually one would expect the parallel efficiency to increase as the 
node-level performance goes down, because communication and 
synchronization overheads become less important when the pure 
compute time goes up. On SuperMUC, the opposite is the case: The 



minimum parallel efficiency (at 128 nodes) varies between 76 and 
63 % (depending on the number of cores per chip) for 2.7 GHz, but 
between 69 and 61 % at 1.2 GHz. We conclude that there must be 
a frequency-dependent factor which impedes scalability whenever 
communication overhead plays a significant role. 

In order to explore the reasons for this effect we have conducted 
experiments with "sendrecv" from the Intel MPI benchmark suite 
(1MB) [ ], since it mimics to some extent the ringshift-like halo- 
exchange communication pattern of the ILBDC code. Each MPI 
process exchanges data with its neighbors: MPI_Sendrecv(to 
right neighbor, from left neighbor). The benchmark 
reports the available communication bandwidth per process. In 
Fig. 10 we show the results for two SuperMUC nodes in the two 
comer cases of one process (PPN=1) and 16 processes per node 
(PPN=16) for the two extremal frequencies of 1.2 and 2.7 GHz. 
The placement of the MPI ranks was done in the same way as for 
the ILBDC benchmarks: Neighboring ranks were "packed" to the 
same node as far as possible to minimize inter-node traffic in the 
PPN=16case. 

Although both scenarios show a dependence of the effective MPI 
bandwidth on the clock speed, this is especially pronounced at 
PPN=16, and we see a breakdown of about 40% in communica- 
tion bandwidth within the region of message sizes relevant for the 
ILBDC packed reactor benchmark (shaded area). Moreover, the 
bandwidth of the FDR- 10 IB interface cannot be saturated even 
at the highest frequency setting with PPN=16. We attribute both 
effects to the dominance of intra-node communication, which has a 
strong dependence on clock speed. In contrast, the saturated LBM 
performance with the AA pattern and AVX vectorization only 
drops by about 20 % over the whole frequency range (see Fig. 2). 
This explains the stronger breakdown of parallel efficiency with 
strong scaling at low clock speeds. 

4.2.2 Energy and performance at scale 
The question remains whether one can extrapolate the findings 
about energy to solution and performance from the chip to the 
multi-node level, and especially whether single-core optimizations, 
notable SIMD vectorization, have a similar impact. Figure 11 
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Figure 1 1 : Multi-node energy to solution vs. performance for the 
AA pattern AVX LBM implementation (large reactor case) for dif- 
ferent clock speeds and different node counts. The parameter along 
each curve is the number of processes per chip (4 . . . 8). For com- 
parison, the open symbols show data for the scalar implementation 
on full sockets. 



shows aggregated socket-level energy (as measured via RAPL) vs. 
performance with AVX for the three node counts (32, 64, and 128) 
at which parallel efficiency is between 90 and 60%. Along each 
curve, the number of processes per chip (PPC) is increased from 
four to eight, and the highest energy point at the top of each curve 
is at PPC=8. For reference we also show the corresponding lowest- 
energy data points for the scalar implementation (open symbols). 
The overall rise in energy to solution with growing node count is a 
trivial consequence of the decreasing parallel efficiency. 

The most striking difference to the chip-level results is the notable 
performance degradation after the saturation point, especially at the 
larger node counts (32 and 64). It is caused by the drop in ring- 
shift bandwidth (as described in the previous section) with growing 
PPC, and directly leads to a fast rise in energy to solution, much 
steeper than would be expected by the power model without com- 
munication component. Hence, it is even more crucial in the highly 
parallel case to select the optimal operating point, since each ex- 
pendable core costs an over-proportional amount of energy: At 128 
nodes and 2.7 GHz, the savings in energy consumption when going 
from the full socket to the saturation point is about 35 %, but only 
about 25 % on a single chip (see the 2.7 GHz data in Fig. 7b). 

The strong disadvantage of scalar execution is handed down from 
the chip to the highly parallel level, and intensified by the fact that 
more processes are needed to reach saturation, if this is possible at 
all. As a consequence, a well-vectorized LBM code is instrumen- 
tal for optimal energy to solution, particularly in the highly parallel 
case when communication plays a noticeable (but not dominant) 
role. It is evident that all differences in chip-level performance be- 
come negligible at very low parallel efficiency, when communica- 
tion overhead dominates the execution time. 

Finally we need to comment on how these findings change if a re- 
alistic baseline power is used. Figure 12 shows the same data as 
Fig. 1 1 but with 100 W of power added per node. The results are 
very similar to the chip-level discussion in Sect. 3.2 above: All dif- 
ferences in energy to solution are damped by the larger idle power, 
but there is still more than 30 % gain between a naive scalar code 
run with PPC=8 and the possible optimal operating point (marked 
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Figure 12: Same data as in Fig. 1 1 but with a realistic baseline 
power of 100 W added per node. 



in Fig. 12) with PPC=4 and 2.3 GHz. In contrast to the case where 
only the chip power is considered, the lowest frequency setting of 
1 .2 GHz is very unfavorable: The large performance degradation to- 
gether with the communication bandwidth breakdown problem and 
the large baseline power prohibit the use of very small frequencies, 
even if energy to solution were the only relevant metric. 

5. SUMMARY AND OUTLOOK 

We have analyzed the performance and energy to solution prop- 
erties of a lattice-Boltzmann flow solver on the chip and highly 
parallel levels for an Intel Sandy Bridge EP-based system. Differ- 
ent propagation patterns (pull-split vs. AA), SIMD vectorization 
schemes (scalar vs. SSE/AVX), and the dependence on the clock 
speed of the processors and the number of processes per chip were 
studied using the chip-level ECM performance model and a simple 
power model. In addition to the measured chip power, the (esti- 
mated) "true" system-level power was taken into account in order 
to aiTive at useful predictions for realistic scenarios. 

We found a high single-core performance to be the pivotal instru- 
ment for reaching minimal energy without sacrificing too much 
time to solution. AVX vectorization and the choice of a good prop- 
agation pattern are important components for accomplishing this 
goal. Technical measures such as clock speed reduction have less 
impact but still contribute considerably to the overall energy con- 
sumption. 

When a "good" single-core code has been found and the optimal 
clock speed has been set, it is the identification of the performance 
saturation point with respect to the number of processes on the chip 
level, but more importantly in the highly parallel case, which de- 
cides upon minimal energy to solution. In the latter case, due to the 
dependency of MPI communication bandwidth on the clock speed 
of the processor, using too many cores per chip must be avoided 
when communication plays a non-negligible role. 

The impact of the whole system's baseline power consumption (i.e., 
powered on but with idle cores) is twofold: It attenuates the differ- 
ences in energy to solution caused by technical measures such as 
clock speed adjustments, but it emphasizes the influence of bare 
chip-level performance and communication overhead, especially in 
the highly parallel case. Hence, having a "good" single-chip code 
is even more important when dealing with realistic, i.e., full-system 
power consumption and parallel production runs. 



This work opens the possibihty for future research in muUiple di- 
rections: We have not addressed realistic alternatives to the standard 
quality metrics of energy and time to solution in detail. One such 
cost function could be the energy-delay product, but there may be 
others. Moreover it will be interesting to compare our findings for 
a typical x86-based cluster to a modem low-power system such as 
the IBM Blue Gene/Q. Finally, the ECM model and the multicore 
power model need refinements and adjustments to be able to deal 
with more complex hardware and software scenarios such as the 
expected flexible (per-core) frequency settings on upcoming Intel 
processors. Also it would be worthwhile to explore the reasons for 
the deviation of the ECM model from measurements near the satu- 
ration point, which does not occur for all types of code [8]. 
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